11 research outputs found

    Digital sound synthesis via parallel evolutionary optimization (Paralel evrimsel eniyileme ile sayısal ses sentezleme)

    Get PDF
    In this research, we propose a novel parallelizable architecture for the optimization of various sound synthesis parameters. The architecture employs genetic algorithms to match the parameters of different sound synthesizer topologies to target sounds. The fitness function is evaluated in parallel to decrease its convergence time. Based on the proposed architecture, we have implemented a framework using the SuperCollider audio synthesis and programming environment and conducted several experiments. The results of the experiments have shown that the framework can be utilized for accurate estimation of the sound synthesis parameters at promising speeds

    Using image morphing for memory-efficient impostor rendering on GPU

    Get PDF
    Real-time rendering of large animated crowds consisting thousands of virtual humans is important for several applications including simulations, games and interactive walkthroughs; but cannot be performed using complex polygonal models at interactive frame rates. For that reason, several methods using large numbers of pre-computed image-based representations, which are called as impostors, have been proposed. These methods take the advantage of existing programmable graphics hardware to compensate the computational expense while maintaining the visual fidelity. Making the number of different virtual humans, which can be rendered in real-time, not restricted anymore by the required computational power but by the texture memory consumed for the variety and discretization of their animations. In this work, we proposed an alternative method that reduces the memory consumption by generating compelling intermediate textures using image-morphing techniques. In order to demonstrate the preserved perceptual quality of animations, where half of the key-frames were rendered using the proposed methodology, we have implemented the system using the graphical processing unit and obtained promising results at interactive frame rates

    Augmenting conversations through context-aware multimedia retrieval based on speech recognition

    Get PDF
    Future’s environments will be sensitive and responsive to the presence of people to support them carrying out their everyday life activities, tasks and rituals, in an easy and natural way. Such interactive spaces will use the information and communication technologies to bring the computation into the physical world, in order to enhance ordinary activities of their users. This paper describes a speech-based spoken multimedia retrieval system that can be used to present relevant video-podcast (vodcast) footage, in response to spontaneous speech and conversations during daily life activities. The proposed system allows users to search the spoken content of multimedia files rather than their associated meta-information and let them navigate to the right portion where queried words are spoken by facilitating within-medium searches of multimedia content through a bag-of-words approach. Finally, we have studied the proposed system on different scenarios by using vodcasts in English from various categories, as the targeted multimedia, and discussed how it would enhance people’s everyday life activities by different scenarios including education, entertainment, marketing, news and workplace

    A decision forest based feature selection framework for action recognition from RGB-Depth cameras

    Get PDF
    In this paper, we present an action recognition framework leveraging data mining capabilities of random decision forests trained on kinematic features. We describe human motion via a rich collection of kinematic feature time-series computed from the skeletal representation of the body in motion. We discriminatively optimize a random decision forest model over this collection to identify the most effective subset of features, localized both in time and space. Later, we train a support vector machine classifier on the selected features. This approach improves upon the baseline performance obtained using the whole feature set with a significantly less number of features (one tenth of the original). On MSRC-12 dataset (12 classes), our method achieves 94% accuracy. On the WorkoutSU-10 dataset, collected by our group (10 physical exercise classes), the accuracy is 98%. The approach can also be used to provide insights on the spatiotemporal dynamics of human actions

    NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning

    Full text link
    This paper introduces NoRefER, a novel referenceless quality metric for automatic speech recognition (ASR) systems. Traditional reference-based metrics for evaluating ASR systems require costly ground-truth transcripts. NoRefER overcomes this limitation by fine-tuning a multilingual language model for pair-wise ranking ASR hypotheses using contrastive learning with Siamese network architecture. The self-supervised NoRefER exploits the known quality relationships between hypotheses from multiple compression levels of an ASR for learning to rank intra-sample hypotheses by quality, which is essential for model comparisons. The semi-supervised version also uses a referenced dataset to improve its inter-sample quality ranking, which is crucial for selecting potentially erroneous samples. The results indicate that NoRefER correlates highly with reference-based metrics and their intra-sample ranks, indicating a high potential for referenceless ASR evaluation or a/b testing

    A Reference-less Quality Metric for Automatic Speech Recognition via Contrastive-Learning of a Multi-Language Model with Self-Supervision

    Full text link
    The common standard for quality evaluation of automatic speech recognition (ASR) systems is reference-based metrics such as the Word Error Rate (WER), computed using manual ground-truth transcriptions that are time-consuming and expensive to obtain. This work proposes a multi-language referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions. To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner. In experiments conducted on several unseen test datasets consisting of outputs from top commercial ASR engines in various languages, the proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments, and also reduces WER by more than 7%7\% when used for ensembling hypotheses. The fine-tuned model and experiments are made available for the reproducibility: https://github.com/aixplain/NoRefERComment: arXiv admin note: substantial text overlap with arXiv:2306.1257

    An Interface for Emotional Expression in Audio-Visuals

    No full text
    ABSTRACT In this work, a comprehensive study is performed on the relationship between audio, visual and emotion by applying the principles of cognitive emotion theory into digital creation. The study is driven by an audiovisual emotion library project that is named AVIEM, which provides an interactive interface for experimentation and evaluation of the perception and creation processes of audiovisuals. AVIEM primarily consists of separate audio and visual libraries and grows with user contribution as users explore different combinations between them. The library provides a wide range of experimentation possibilities by allowing users to create audiovisual relations and logging their emotional responses through its interface. Besides being a resourceful tool of experimentation, AVIEM aims to become a source of inspiration, where digitally created abstract virtual environments and soundscapes can elicit target emotions at a preconscious level, by building genuine audiovisual relations that would engage the viewer on a strong emotional stage. Lastly, various schemes are proposed to visualize information extracted through AVIEM, to improve the navigation and designate the trends and dependencies among audiovisual relations

    Digital music performance for mobile devices based on magnetic interaction

    No full text
    Digital music performance requires a high degree of interaction with input controllers that can provide fast feedback on the user's action. One of the primary considerations of professional artists is a powerful and creative tool that minimizes the number of steps required for the speed-demanding processes. Nowadays, mobile devices have become popular digital instruments for musical performance. Most of the applications designed for mobile devices use touch screen, keypad, or accelerometer as interaction modalities. In this paper, we present a novel interface for musical performance that is based on a magnetic interaction between a user and a device. The proposed method constitutes a touchless interaction modality that is based on the mutual effect between the magnetic field surrounding a device and that of a properly shaped magnet. Extending the interaction space beyond the physical boundary of a device provides the user with higher degree of flexibility for musical performance which, in turn, can open doors to a wide spectrum of new functionalities in digital music performance and production

    Real-time feature-based image morphing for memory-efficient impostor rendering and animation on GPU

    No full text
    Real-time rendering of large animated crowds consisting of thousands of virtual humans is important for several applications including simulations, games, and interactive walkthroughs but cannot be performed using complex polygonal models at interactive frame rates. For that reason, methods using large numbers of precomputed image-based representations, called impostors, have been proposed. These methods take advantage of existing programmable graphics hardware to compensate for computational expense while maintaining visual fidelity. Thanks to these methods, the number of different virtual humans rendered in real time is no longer restricted by computational power but by texture memory consumed for the variety and discretization of their animations. This work proposes a resource-efficient impostor rendering methodology that employs image morphing techniques to reduce memory consumption while preserving perceptual quality, thus allowing higher diversity or resolution of the rendered crowds. Results of the experiments indicated that the proposed method, in comparison with conventional impostor rendering techniques, can obtain 38 % smoother animations or 87 % better appearance quality by reducing the number of key-frames required for preserving the animation quality via resynthesizing them with up to 92 % similarity on real time
    corecore